Given the high relevance of COVID-19 in 2020 and the impact it has on the vast majority of individuals around the globe, staying informed regarding the current status and spread of COVID around the world has become a daily task for many people. While the severity of the situation necessitates daily monitoring and continuous tracking of the virus, generating enormous amounts of data each day, this also makes it easy for the general population to be overwhelmed by the overabundance of COVID-related data made available from the scientific community. When dealing with such the issue of big datasets, data visualization can serve the critical function of making large data more accessible and comprehensible to the public and the decision makers at large. In our project, we will explore a few of these different graphical concepts that may facilitate a better understanding of the large amounts of information and see how such concepts can turn simple numbers into a relevant narrative.
In this project, we use datasets from two different sources.
The first dataset, containing racial and ethnic data on COVID-19 cases in the United States, comes from a collaboration between COVID Tracking Project and Boston University Center for Antiracist Research. This racial COVID data include the number of cases and deaths for different races and ethnicities that are reported across all 50 states. For this first dataset, marginal histogram is used to visualize whether COVID-19 affects different races disproportionately across the US.
The second dataset comes from Our World in Data, which gathers COVID-19 data from various government agencies across the globe. In addition to reporting more basic measures such as the number of cases, deaths, hospitalization, testing, etc., this dataset also includes social measures such as the government stringency index and the human development index. The government stringency index is calculated by considering various restrictions the government imposes on the citizens such as wearing a mask in public, workplace closure, international travel ban, etc. The human development index is calculated by considering various measures of achievement in key dimensions of human development in the country such as life expectancy and standard of living.
With this dataset, we’ll first demonstrate the usage of bubble plots. Bubble plot allows plotting of more than 2 variables with the weight or size of the bubble serving as the 3rd axis, and will be used to visualize the relationship between these social measures and the number of COVID-19 deaths in a country.
Using the same dataset, we’ll also demonstrate how to create an interactive plot. Interactive plot allows users to manipulate the ways in which the data can be viewed. Some of the ways in which one can interact with the plot may be as simple as being able to zoom in and out of different sections of the plot, include or exclude certain variables in the plot, or click on a particular bar or point to see the actual value. We will walk through the implementation of these mentioned interactive functions into our plots. This interactive plot will be a bar graph showing the number of COVID-19 cases, deaths, hospitalizations, and testing numbers in 5 different Eastern European countries.
In this tutorial, we are using R, STATA, MATLAB and Python to show how these graphical concepts may be implemented.
| Software | Relevant Libraries |
|---|---|
| R | tidyverse (manipulation of data), ggplot2 (plotting), Cowplot (plotting), ggplotly (adding interactive functionality) |
| STATA | the built-in command scatter and twoway histogram used to create the three subplots in marginal plots, and user-written command grc1leg is used to combine these subplots into one figure with a command legend. Scatter is also used to create the bubble plot. |
| Python | numpy (manipulation of data) , pandas (manipulation of data), matplotlib (plotting), seaborn (plotting), plotly (adding interactive functionality) |
| MATLAB | Clickable Legend Toolbox from MATLAB File Exchange (adding interactive functionality) + built-in base functions |
scatterhistogramclickableLegend# setup
library(tidyverse)
filename_racial = "./data/Race Data Entry - CRDT.csv"
racial_data = read_delim(
filename_racial, delim = ",",
col_types = cols(
.default = col_integer(),
Date = col_character(),
State = col_character()
)
) %>%
pivot_longer(
cols = starts_with(c("Cases", "Deaths")),
names_pattern = "(Cases|Deaths)_(.*)",
names_to = c(".value", "Race")
) %>%
filter(!str_detect(Race, "Total|Ethnicity"))
date_picked = "20201101"
race_picked = c("White", "Black", "LatinX", "Asian")
plot1_data = racial_data %>%
filter(
Date == date_picked,
str_detect(Race, paste0(race_picked, collapse = "|"))
) %>%
mutate(
Race = factor(Race, levels = race_picked, ordered = TRUE)
)
| Date | State | Race | Cases | Deaths |
|---|---|---|---|---|
| 20201101 | AK | White | 5063 | 31 |
| 20201101 | AK | Black | 574 | 3 |
| 20201101 | AK | LatinX | NA | NA |
| 20201101 | AK | Asian | 680 | 8 |
| 20201101 | AL | White | 59056 | 1555 |
| Date | State | Race | Cases | Deaths |
|---|---|---|---|---|
| 0/224 | 0/224 | 0/224 | 51/224 | 54/224 |
plot1_title = sprintf("%s (%s)",
"Total confirmed COVID-19 deaths vs. cases, U.S. States",
format(as.Date(date_picked, "%Y%m%d"), "%m/%d/%y")
)
palette_picked = "Set1"
pic1_main = plot1_data %>%
ggplot(aes(x = Cases, y = Deaths, color = Race)) +
theme_bw() +
geom_point(size = 2, alpha = .7) +
scale_color_brewer(palette = palette_picked) +
ggtitle(plot1_title) +
theme(
plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 9)
)
pic1_main
pic1 = pic1_main +
theme(legend.position = "left")
ggExtra::ggMarginal(
pic1, type="histogram",
groupColour = TRUE, groupFill = TRUE
)
We can create marginal boxplots by setting type = "boxplot":
ggExtra::ggMarginal(
pic1, type="boxplot",
groupColour = TRUE, groupFill = TRUE
)
We can also specify the relative size of the main plot to the marginal ones using parameter size:
ggExtra::ggMarginal(
pic1, type="boxplot", size = 7,
groupColour = TRUE, groupFill = TRUE
)
A size of 5 means that the main plot is 5x wider and 5x taller than the marginal plots.
Although ggMarginal() provides a easy solution to marginal plots, it is not friendly to customization. For example, we cannot set the two marginal plots to be of different types or of different sizes. Neither can we specify the exact size for marginal plots in units such as inch.
Check out the Cowplot section for a more flexible solution to creating marginal plots in R.
Using package Cowplot to create marginal plots is less succinct, but more flexible. Unlike ggExtra::ggMarginal, Cowplot allows:
pic1_xhist = cowplot::axis_canvas(pic1_main, axis = "x") +
geom_histogram(
data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
aes(x = Cases, fill = Race, color = Race),
bins = 30, alpha = .6
) +
scale_fill_brewer(palette = palette_picked) +
scale_color_brewer(palette = palette_picked)
pic1_yhist = cowplot::axis_canvas(pic1_main, axis = "y", coord_flip = TRUE) +
geom_histogram(
data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
aes(x = Deaths, fill = Race, color = Race),
bins = 30, alpha = .6
) +
scale_fill_brewer(palette = palette_picked) +
scale_color_brewer(palette = palette_picked) +
coord_flip()
pic1_xhist
pic1_yhist
{pic1_main +
theme(legend.position = "left")
} %>%
cowplot::insert_xaxis_grob(
pic1_xhist,
height = grid::unit(.8, "in"),
position = "top"
) %>%
cowplot::insert_yaxis_grob(
pic1_yhist,
width = grid::unit(.8, "in"),
position = "right"
) %>%
cowplot::ggdraw()
We can move the marginal plots from the top/right side of the original figure to the bottom/left side.
To do that, we may want to flip the two previously created marginal histograms using scale_y_reverse():
pic1_xhist + scale_y_reverse()
To make room for the marginal plots, we can move the x-axis and y-axis to the top/right side of the original figure by setting the position parameter in scale_*_continuous():
{pic1_main +
scale_x_continuous(position = "top") +
scale_y_continuous(position = "right")
} %>%
cowplot::insert_xaxis_grob(
pic1_xhist + scale_y_reverse(),
height = grid::unit(.8, "in"),
position = "bottom"
) %>%
cowplot::insert_yaxis_grob(
pic1_yhist + scale_y_reverse(),
width = grid::unit(.8, "in"),
position = "left"
) %>%
cowplot::ggdraw()
pic1_ybox = cowplot::axis_canvas(pic1_main, axis = "y") +
geom_boxplot(
data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
aes(x = 0, y = Deaths, fill = Race, color = Race),
alpha = .6
) +
scale_fill_brewer(palette = palette_picked) +
scale_color_brewer(palette = palette_picked)
{pic1_main +
theme(legend.position = "left")
} %>%
cowplot::insert_xaxis_grob(
pic1_xhist,
height = grid::unit(.8, "in"),
position = "top"
) %>%
cowplot::insert_yaxis_grob(
pic1_ybox,
width = grid::unit(.8, "in"),
position = "right"
) %>%
cowplot::ggdraw()
We can even call cowplot::insert_yaxis_grob() twice to add multiple margin plots for the y axis (although probably unnecessary).
cd "E:\git\Stats506_midterm_project\STATA\"
import delimited "data\Race Data Entry - CRDT.csv", ///
stringcols(1) numericcols(3/28) clear
keep if date == "20201101"
keep state *white *black *latinx *asian
reshape long cases_ deaths_, i(state) j(race) string
rename cases_ cases
rename deaths_ deaths
// turn race into a ordered factor
gen race_int = 1
replace race_int = 2 if race == "black"
replace race_int = 3 if race == "latinx"
replace race_int = 4 if race == "asian"
label define race_label 1 "White" 2 "Black" 3 "LatinX" 4 "Asian"
label values race_int race_label
drop race
rename race_int race
drop if cases == . | deaths == .
separate deaths, by(race) veryshortlabel
We will be Using the user written command grc1leg by Vince Wiggins to create the marginal plots.
grc1leg. Combine graphs into one graph with a common legend. / Program by Vince Wiggins, StataCorp vwiggins@stata.com. / Statalist distribution, 16 June 2003. / Distribution-Date: 02jun2010
Use the following line to install grc1leg:
net install grc1leg, from(http://www.stata.com/users/vwiggins/)
Note:
- Run the code chunk from ‘local alpha’ to ‘twoway histogram cases’ at once to make sure the local variables can be used by all three subplots.
- The optacity settings (e.g.
mcolor(red%30)) are only available in STATA15 or above. For STATA14 or below, try using hollow circles instead of solid ones by adding the optionmsymbol(Oh).
// create the main plot
local alpha %30
local xscale_opt xscale(range(0 415000))
local yscale_opt yscale(range(0 10500))
local scatter_opt mcolor(red`alpha' blue`alpha' green`alpha' purple`alpha')
scatter deaths? cases, `scatter_opt' ///
`xscale_opt' `yscale_opt' ///
legend(subtitle("Race") position(9) cols(1) region(style(none))) ///
ytitle("Deaths") xtitle("Cases") xlabel(, grid) ///
saving(yx_tr, replace)
// create the two marginal histograms
local hist_opt color(navy%30) bin(30)
twoway histogram deaths, `hist_opt' fraction ///
`yscale_opt' ysca(alt) horiz fxsize(25) ///
ytitle("") xlabel(, nogrid) ///
saving(hy_tr, replace)
twoway histogram cases, `hist_opt' fraction ///
`xscale_opt' xsca(alt) fysize(25) ///
xtitle("") xlabel(, grid) ylabel(, nogrid) ///
saving(hx_tr, replace)
// group the three plots together
local graph_title = "Total confirmed COVID-19 deaths vs. cases, " + ///
"U.S. States (11/01/20)"
grc1leg hx_tr.gph yx_tr.gph hy_tr.gph, ///
colfirst hole(3) imargin(0 0 0 0) ///
title("`graph_title'", size(medium)) ///
legendfrom(yx_tr.gph) position(9)
// save the final output
graph export stata-marginplot-tr.png, replace
// remove the intermediate files from disk
erase hx_tr.gph
erase yx_tr.gph
erase hy_tr.gph
// create the main plot
local alpha %30
local xscale_opt xscale(range(0 415000))
local yscale_opt yscale(range(0 10500))
local scatter_opt mcolor(red`alpha' blue`alpha' green`alpha' purple`alpha')
scatter deaths? cases, `scatter_opt' ///
`xscale_opt' `yscale_opt' ///
legend(subtitle("Race") position(3) cols(1) region(style(none))) ///
ytitle("Deaths") xtitle("Cases") ///
xlabel(, grid) ysca(alt) xsca(alt) ///
saving(yx_bl, replace)
// create the two marginal histograms
local hist_opt color(navy%30) bin(30)
twoway histogram deaths, `hist_opt' fraction ///
`yscale_opt' xsca(alt reverse) horiz fxsize(25) ///
ytitle("") xlabel(, nogrid) ///
saving(hy_bl, replace)
twoway histogram cases, `hist_opt' fraction ///
`xscale_opt' ysca(alt reverse) fysize(25) ///
xtitle("") xlabel(, grid) ylabel(, nogrid) ///
saving(hx_bl, replace)
// group the three plots together
local graph_title = "Total confirmed COVID-19 deaths vs. cases, " + ///
"U.S. States (11/01/20)"
grc1leg hy_bl.gph yx_bl.gph hx_bl.gph, ///
hole(3) imargin(0 0 0 0) ///
title("`graph_title'", size(medium)) ///
legendfrom(yx_bl.gph) position(3)
// save the final output
graph export stata-marginplot-bl.png, replace
// remove the intermediate files from disk
erase hy_bl.gph
erase yx_bl.gph
erase hx_bl.gph
Prior to any plotting, we’ll organize the csv file so it only contains the relevant data necessary for plotting.
%% Date Prep
% Read in data
fileName = 'Race Data Entry - CRDT.csv';
raceData = readtable(fileName);
%Select only 11/01/20 data
newestDate = 20201101;
raceData = raceData(raceData.Date == newestDate,:);
% Convert all case/death # to double
colnames = raceData.Properties.VariableNames;
for i = 3:length(colnames)
if iscell(raceData.(genvarname(colnames{i})))
raceData.(genvarname(colnames{i})) = ...
str2double(raceData.(genvarname(colnames{i})));
end
end
Narrowing down the dataset to only look at 4 different racial groups: White, Black, Latinx, and Asian.
%% Get race info
% Grabbing the different races identified in dataset
race_idx = find(contains(colnames, 'Deaths'));
race_idx = race_idx(2 : 10); % remove total and ethnicity info
races = raceData(end, race_idx);
raceID = strrep(races.Properties.VariableNames, 'Deaths_','');
% selecting relevant columns
plotData = raceData(1 : end, {'Date', 'State', 'Cases_White', ...
'Cases_Black', 'Cases_LatinX', 'Cases_Asian', 'Deaths_White', ...
'Deaths_Black', 'Deaths_LatinX', 'Deaths_Asian'});
% reshape case variable for plotting
case_col = startsWith(plotData.Properties.VariableNames, 'Cases');
cases = table2array(plotData(:, case_col));
cases = reshape(cases, [], 1);
% reshape death variable for plotting
death_col = startsWith(plotData.Properties.VariableNames, 'Deaths');
deaths = table2array(plotData(:, death_col));
deaths = reshape(deaths, [], 1);
MATLAB’s scatterhist plot requires that we pass in group label that is the same dimension as the x and y datapoints being plotted, so we create a new array containing matched labels.
% Generate race labels matching the column vector for cases/deaths
race_label = [];
for i = 1 : 4
race_label = [race_label; repmat(raceID(i), size(raceData,1), 1)];
end
Now that all the variables are ready, we use the scatterhist function to create a scatter plot with histogram plotted along the margins. Here, we specify group label colors to match colors used in R plot. We also specify the location where want the histograms to be plotted with respect to the scatter plot using argument pair ‘Location’, ‘NorthEast’ and the direction that we want the histogram to point by argument pair ‘Direction’, ‘out’.
x = cases;
y = deaths;
% Specify color to be used for each race grouping
red = [236, 95, 97] / 255;
blue = [115, 165, 205] / 255;
green = [131, 199, 129] / 255;
purple = [173, 113, 182] / 255;
c_array = [red; blue; green; purple];
figure
% scatterhist does not allow us to adjust the transparency of the plots,
% to do this we can generate marginal histogram from scratch using subplot
h = scatterhist(x, y, 'Group', race_label, 'Style', 'bar', ...
'marker', '.', 'Location', 'NorthEast', 'Direction','out', ...
'color', c_array, 'MarkerSize', 16);
[leg,~] = legend('show');
title(leg,'Race')
xlabel('Cases')
ylabel('Deaths')
sgtitle('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/03/20)')
While the built-in function, scatterhist, is useful for quick plotting because many properties of the figure and the layout do not have to be specified, faces the downside of not being able to customize various settings which the user may be looking for. Alternatively, we may use subplot functionality to get around this issue which requires manually creating a marginal histogram. Note that this process will be much more tedious.
In MATLAB, we can also manually create a marginal histogram by arranging together 2 histograms and a scatterplot together into a single figure. Because we’re working with multiple, separate functions like hist and scatter with distinct handles for each group, we have much larger customizability of the arranged plots, where as scatterhist prevents users from grabbing all of figure’s property handles and modifying them, like transparency and bin width specific to each histogram.
We’ll use the toolbox ‘subaxis’ which builds off the built-in subplot functionality in order to minimize the margins between each subplots. This toolbox can be directly downloaded from link but it is also included in the repository containing this tutorial.
addpath('subaxis')
Working off the preprocessed data from earlier steps above, we now have to manually plot a histogram for each race and overlay them on top of one another. We do this two times for each of the two histograms (deaths and cases) that’s needed for a marginal histogram.
subaxis functions allows to arrange multiple plots into a single figure and additionally allows us to specify the margins surrounding each subplot.
%% Using subplot to create marginal histogram
% breaking race data into separate arrays
white = find(race_label == "White");
black = find(race_label == "Black");
latinx = find(race_label == "LatinX");
asian = find(race_label == "Asian");
figure(1)
clf
% y-data histogram
ah1 = subaxis(2, 2, 4, 'sh', 0, 'sv', 0.01, 'padding', 0, 'margin', 0.1);
p1 = histogram(deaths(white), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', red);
hold on
p2 = histogram(deaths(black), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', blue);
hold on
p3 = histogram(deaths(latinx), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', green);
hold on
p4 = histogram(deaths(asian), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', purple);
% x-data histogram
ah2 = subaxis(2, 2, 1,'sh', 0.03, 'sv', 0.03, 'padding', 0, 'margin', 0.1);
histogram(cases(white), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', red)
hold on
histogram(cases(black), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', blue)
hold on
histogram(cases(latinx), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', green)
hold on
histogram(cases(asian), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', purple)
The transparency of the scatter points can now be specified because this feature is supported through built-in function scatter, using the argument Markerfacealpha. I have specified 60% (denoted as 0.6) opacity for all the points.
% scatterplot
ah3 = subaxis(2, 2, 3, 'sh', 0.03, 'sv', 0.01, ...
'padding', 0, 'margin', 0.1);
hold on
scatter(cases(white), deaths(white), 'filled', ...
'MarkerFaceColor', red, 'Markerfacealpha', 0.6)
scatter(cases(black), deaths(black), 'filled', ...
'MarkerFaceColor', blue, 'Markerfacealpha', 0.6)
scatter(cases(latinx), deaths(latinx), 'filled', ...
'MarkerFaceColor', green, 'Markerfacealpha', 0.6)
scatter(cases(asian), deaths(asian), 'filled', ...
'MarkerFaceColor', purple, 'Markerfacealpha', 0.6)
We need to remove all the boxes and axes information from the histograms since we’re combining them with the scatterplot. Similar to function scatterhist‘s argument ’Direction’ specifying the direction which the histogram plots, View property of the subplots can be adjusted to rotate the histograms in the desired direction (commented out in the code below ah1.View = [180, -90]). Similarly, the arrangement of the subplot is reponsible for specifying the location of the histograms, whether they’re to the top or bottom, and left or right of the scatter plot.
% Remove boxes and axes information from the histogram
linkaxes([ah1, ah3], 'y')
linkaxes([ah3, ah2], 'x')
ah1.Box = 'off';
% ah1.View = [180, -90];
ah1.Visible = 'off';
ah2.Visible = 'off';
ah2.Box = 'off';
%ah2.View = [0, -90];
We finish off by creating a single universal legend for all three plots. Since this will not happen automatically when manually plotting and arranging a marginal histogram.
% Create single universal legend
h = [p1;p2;p3;p4];
hold on
lgd = legend(h, 'White', 'Black', 'Latinx', 'Asian');
lgd.Box = 'off';
newPosition = [0.5 0.6 0.13 0.13];
newUnits = 'normalized';
set(lgd, 'Position', newPosition, 'Units', newUnits);
title(lgd, 'Races')
% Give title and label axis in scatterplot
sgtitle('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/03/20)')
xlabel('Cases')
ylabel('Deaths')
# setup
import numpy as np
import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
df = pd.read_csv('./Data/Race Data Entry - CRDT.csv')
new = df[df['Date'] == 20201101]
new = new[['Cases_Asian', 'Cases_Black', 'Cases_LatinX','Cases_White',
'Deaths_Asian', 'Deaths_Black', 'Deaths_LatinX', 'Deaths_White']]
new = new.dropna(axis=0, how='any') # drop all rows that have any NaN values
new.head()
## Cases_Asian Cases_Black Cases_LatinX Cases_White Deaths_Asian \
## 4 2745.0 7621.0 75735.0 65521 68.0
## 5 36626.0 27719.0 401503.0 116385 2060.0
## 6 1879.0 3835.0 38963.0 42337 59.0
## 7 994.0 8329.0 13807.0 23943 50.0
## 9 350.0 6488.0 5379.0 10018 1.0
##
## Deaths_Black Deaths_LatinX Deaths_White
## 4 185.0 1793.0 2540.0
## 5 1308.0 8506.0 5269.0
## 6 124.0 488.0 1308.0
## 7 674.0 412.0 3373.0
## 9 183.0 50.0 463.0
left, width = 0.1, 0.65
bottom, height = 0.1, 0.65
spacing = 0.005
x1 = new['Cases_Asian']
y1 = new['Deaths_Asian']
x2 = new['Cases_Black']
y2 = new['Deaths_Black']
x3 = new['Cases_LatinX']
y3 = new['Deaths_LatinX']
x4 = [float(x) for x in new['Cases_White']]
y4 = new['Deaths_White']
rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]
xmax = max(np.max(x1), np.max(x2), np.max(x3))
ymax = max(np.max(y1), np.max(y2), np.max(y3))
binsx = np.arange(0, xmax + 1000, 9000)
binsy = np.arange(0, ymax + 200, 200)
matplotlibfig = plt.figure(figsize=(10, 8))
ax = fig.add_axes(rect_scatter)
plt.xlabel('Infectious Number')
plt.ylabel('Death Number')
plt.grid(color='LightGrey')
blue_patch = mpatches.Patch(color='blue', label='Asian')
purple_patch = mpatches.Patch(color='purple', label='Black')
red_patch = mpatches.Patch(color='red', label='LatinX')
green_patch = mpatches.Patch(color='green', label='White')
ax_histx = fig.add_axes(rect_histx, sharex=ax)
ax_histy = fig.add_axes(rect_histy, sharey=ax)
plt.legend(handles=[blue_patch, purple_patch, red_patch, green_patch], loc = 'best')
ax.scatter(x1, y1, color = 'blue', alpha = 0.5)
ax_histx.hist(x1, bins=binsx, color = 'blue', alpha = 0.5)
ax_histy.hist(y1, bins=binsy, orientation='horizontal', color = 'blue', alpha = 0.5)
ax.scatter(x2, y2, color = 'purple', alpha = 0.5)
ax_histx.hist(x2, bins=binsx, color = 'purple', alpha = 0.5)
ax_histy.hist(y2, bins=binsy, orientation='horizontal', color = 'purple', alpha = 0.5)
ax.scatter(x3, y3, color = 'red', alpha = 0.5)
ax_histx.hist(x3, bins=binsx, color = 'red', alpha = 0.5)
ax_histy.hist(y3, bins=binsy, orientation='horizontal', color = 'red', alpha = 0.5)
ax.scatter(x4, y4, color = 'green', alpha = 0.5)
ax_histx.hist(x4, bins=binsx, color = 'green', alpha = 0.5)
ax_histy.hist(y4, bins=binsy, orientation='horizontal', color = 'green', alpha = 0.5)
plt.show()
In our project, we use bubble plot to see the relationship between stringency index, human development index, and total # of deaths in each country. Each bubble represents a single country with x-axis measuring the stringency index, y-axis measuring the human development index, and the size of the bubble measuring the total number of deaths in the country at the time of 10/20/20. Bubble plot is useful because it allows us to easily visualize 3 dimensions of the data rather than the conventional 2.
# setup
library(tidyverse)
filename_covid = "./data/owid-covid-data.csv"
plot2_data = read_delim(filename_covid, delim = ",") %>%
select(date, iso_code, continent, total_deaths_per_million,
stringency_index, human_development_index) %>%
filter(date == "2020-10-20")
| date | iso_code | continent | total_deaths_per_million | stringency_index | human_development_index |
|---|---|---|---|---|---|
| 2020-10-20 | ABW | North America | 318.453 | 50.46 | NA |
| 2020-10-20 | AFG | Asia | 38.455 | 16.67 | 0.498 |
| 2020-10-20 | AGO | Africa | 7.515 | 71.30 | 0.581 |
| 2020-10-20 | AIA | North America | NA | NA | NA |
| 2020-10-20 | ALB | Europe | 157.759 | 42.59 | 0.785 |
| date | iso_code | continent | total_deaths_per_million | stringency_index | human_development_index |
|---|---|---|---|---|---|
| 0/214 | 1/214 | 2/214 | 23/214 | 42/214 | 33/214 |
plot2_title = paste0(
"Total #of Deaths Across Gov. Stringency Index",
" and Human Development Index"
)
plot2_data %>%
filter(!is.na(continent)) %>% # remove 'NA' from the plot's legend
ggplot(aes(x = stringency_index, y = human_development_index,
size = total_deaths_per_million, color = continent)) +
theme_bw() +
geom_point(alpha = .5) +
scale_size_continuous(
name = "Total Death (Per Mil.)",
breaks = seq(100, 1300, 300),
range = c(2, 14)
) +
scale_color_brewer(
name = "Continent",
palette = "Set1"
) +
scale_x_continuous(
name = "Stringency Index",
breaks = seq(0, 100, 10)
) +
scale_y_continuous(
name = "Human Development Index",
breaks = seq(0, 1, 0.1)
) +
ggtitle(plot2_title) +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 10)
) +
guides(color = guide_legend(override.aes = list(size = 3))) # enlarge the symbol size in the color legend
import delimited "data\owid-covid-data.csv", clear
// keep only relevant variables & data points
keep if date == "2020-10-20"
keep iso_code continent total_deaths_per_million ///
stringency_index human_development_index
drop if continent == "" //drop international world data points
// Drop data pt. if either of the index measures are missing
drop if stringency_index ==. | ///
human_development_index==. | ///
total_deaths_per_million ==.
save covid_plot_data, replace
Note: The optacity settings (e.g.
mcolor(red%30)) are only available in STATA15 or above. For STATA14 or below, try using hollow circles instead of solid ones by adding the optionmsymbol(Oh).
use covid_plot_data, clear
separate human_development_index, by(continent) veryshortlabel
local vars human_development_index? stringency_index ///
[aw = total_deaths_per_million]
local symbol O
local size .75
local alpha %30
local options xscale(range(0 80)) yscale(range(0.32 1)) ///
mcolor(red`alpha' blue`alpha' green`alpha' ///
purple`alpha' orange`alpha' yellow`alpha') ///
msymbol(`symbol' `symbol' `symbol' `symbol' `symbol' `symbol') ///
msize(`size' `size' `size' `size' `size' `size')
local graph_title = ("Total # of Deaths Across Gov. Stringency Index " + ///
"and Human Development Index")
scatter `vars', `options' ///
legend(label(4 "N. America") label(6 "S. America") ///
subtitle("Continent") position(3) cols(1) ///
region(style(none)) ///
) ///
title("`graph_title'", size(small)) ///
ytitle("Human Development Index") xtitle("Government Stringency Index")
// save the plot
graph export stata-bubbleplot.png, replace
Bubble plot is a relatively new function (bubblechart) introduced to MATLAB in R2020b.
%% BubblePlots
% For ref see: https://www.mathworks.com/help/matlab/ref/bubblechart.html
%% Data Prep
fileName = './Data/owid-covid-data.csv';
covid_orig = readtable(fileName);
% Select 2020-10-20 Data
newestDate = datetime([2020 10 20]);
covid = covid_orig(covid_orig.date == newestDate,:);
Checking the # of NaN values for the stringency index:
sum(isnan(covid.stringency_index))
## ans = 40
Checking the # of NaN values for the human development index:
sum(isnan(covid.human_development_index))
## ans = 33
% Variables of interest (VOI)
voi = {'location', 'continent', 'total_cases_per_million', ...
'total_deaths_per_million','stringency_index',...
'human_development_index'};
% Remove international and world info and keep only variables of interest
covid = covid(1:end-2, voi);
We want to group the countries by its continent membership to see if there’re any patterns to certain continents being more stringent in COVID-19 restrictions or if certain continents are being impacted more heavily with COVID-19 deaths.
% Convert continent label to color label to be used in bubble plot
for k = 1 : size(covid, 1)
if strcmp(covid.continent(k), 'Africa')
covid.continent{k} = 1;
elseif strcmp(covid.continent(k), 'Asia')
covid.continent{k} = 2;
elseif strcmp(covid.continent(k), 'Europe')
covid.continent{k} = 3;
elseif strcmp(covid.continent(k), 'North America')
covid.continent{k} = 4;
elseif strcmp(covid.continent(k), 'Oceania')
covid.continent{k} = 5;
elseif strcmp(covid.continent(k), 'South America')
covid.continent{k} = 6;
end
end
covid = sortrows(covid, 2);
x = table2array(covid(:,5)); % x-axis: Stringency index
y = table2array(covid(:,6)); % y-axis: Human development index
sz = table2array(covid(:,4)); % size: Total deaths per million
c = cell2mat(table2array(covid(:,2))); % color: continent
In MATLAB, to create a bubble chart with different grouping colors, different groups have to be plotted separately and then overlaid on top of one another.
% Stringency index of countries grouped by continent
x1 = x(c==1); x2 = x(c==2); x3 = x(c==3);
x4 = x(c==4); x5 = x(c==5); x6 = x(c==6);
% Human development index of countries grouped by continent
y1 = y(c==1); y2 = y(c==2); y3 = y(c==3);
y4 = y(c==4); y5 = y(c==5); y6 = y(c==6);
% Total Deaths per mil. of countries grouped by continent
sz1 = sz(c==1); sz2 = sz(c==2); sz3 = sz(c==3);
sz4 = sz(c==4); sz5 = sz(c==5); sz6 = sz(c==6);
%% Plot data in a tiled chart layout
figure
% colors for plotting
red = [234, 82, 84]/255;
blue = [105, 158, 201]/255;
green = [99, 185, 96]/255;
purple = [177, 122, 186]/255;
orange = [255, 186, 117]/255;
yellow = [255, 255, 153]/255;
t = tiledlayout(1,1);
nexttile
bubblechart(x1, y1, sz1, red) % Africa
hold on
bubblechart(x2, y2, sz2, blue) % Asia
hold on
bubblechart(x3, y3, sz3, green) % Europe
hold on
bubblechart(x4, y4, sz4, purple) % N. America
hold on
bubblechart(x5, y5, sz5, orange) % Oceania
hold on
bubblechart(x6, y6, sz6, yellow) % S. America
hold off
sgtitle('Total # of Deaths Across Gov. Stringency Index and Human Development Index')
xlabel('Stringency Index')
ylabel('Human Development Index')
We want to include both legend for the continents and the legend for the size of the bubbles (bubblelegend). In order to include these 2 separate legends, we have to specify a layout of the tile.
blgd = bubblelegend('Total Death (Per Mil.)','Location','eastoutside');
lgd = legend('Africa','Asia', 'Europe', 'North America', ...
'Oceania', 'South America');
title(lgd, 'Continent')
blgd.Layout.Tile = 'east';
lgd.Layout.Tile = 'east';
While the default value of transparency of the bubbles is set at 60% opacity, since we have large number of data points, we’ll bring it down to 40% so it’s easier to see all the points.
b1.MarkerFaceAlpha = 0.4; b2.MarkerFaceAlpha = 0.4;
b3.MarkerFaceAlpha = 0.4; b4.MarkerFaceAlpha = 0.4;
b5.MarkerFaceAlpha = 0.4; b6.MarkerFaceAlpha = 0.4;
We can further specify the color of the bubbles for each group with the pair option parameter ‘bubble color’ by inputting it into as an in put parameter in the bubblechart function call:
ex1 = bubblechart(x1,y1,sz1, "red");
Or specify the color (in RGB value array) once the plot has been created by assigning its property ‘CData’:
ex2 = bubblechart(x1,y1,sz1, "red");
ex2.CData = [0 0 0];
# setup
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
df2 = pd.read_csv("./Data/owid-covid-data.csv")
new2 = df2[df2['date'] == '2020-10-20']
new2 = new2[['continent', 'total_deaths_per_million',
'stringency_index', 'human_development_index']]
new2 = new2.dropna(axis=0, how='any')
new2.head()
## continent total_deaths_per_million stringency_index \
## 527 Asia 38.455 16.67
## 756 Africa 7.515 71.30
## 1222 Europe 157.759 42.59
## 1465 Europe 802.433 53.70
## 1776 Asia 47.116 52.78
##
## human_development_index
## 527 0.498
## 756 0.581
## 1222 0.785
## 1465 0.858
## 1776 0.863
plt.figure(figsize=(11, 7))
sns.scatterplot(x='stringency_index',
y='human_development_index',
size=new2['total_deaths_per_million'],
sizes=(10,1300),
alpha=0.6,
hue=new2['continent'],
data=new2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.03, 1),borderaxespad=0,
prop={'size':12}, handlelength=2,
frameon=False)
plt.rcParams.update({'legend.labelspacing':1.85})
plt.xlabel('Stringency Index')
plt.ylabel('Human Development Index')
plt.title('Total # of Death Across Stringency Index & Human Development Index')
plt.grid(color='LightGrey')
plt.tight_layout()
plt.savefig("Bubble_plot_Seaborn_color_by_variable_Seaborn_scatterplot.png",
format='png',dpi=150)
continent_list = ['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']
color_list = ['red', 'blue', 'green', 'purple', 'gold', 'pink']
colors = {'Africa':'red', 'Asia':'blue', 'Europe':'green',
'North America':'purple', 'Oceania': 'gold',
'South America': 'pink'}
l = []
for i in range(6):
l.append(mpatches.Patch( color=color_list[i],
alpha=0.7,
label=continent_list[i]))
plt.figure(figsize = (9,8))
plt.scatter('stringency_index',
'human_development_index',
s='total_deaths_per_million',
c=new2['continent'].apply(lambda x: colors[x]),
alpha=0.7,
data=new2)
plt.grid(color='LightGrey')
plt.xlabel('Stringency Index')
plt.ylabel('Human Development Index')
plt.title('Total # of Death Across Stringency Index & Human Development Index')
legend1 = plt.legend(handles=l, loc = (1.02,0.78))
plt.show()
Interactive plots allow us to manipulate the viewing of the data points plotted and can be useful in focusing on specific aspect of the data. In this example, we will plot an interactive histogram comparing the # of cases, deaths, tests performed, and hospitalized across a few different countries in Europe (dataset). Clicking the legend in the plot will either hide or show the bar associated with the specific item in the legend.
# setup
library(tidyverse)
filename_covid = "./data/owid-covid-data.csv"
variable_list = c(
"total_cases", "total_deaths",
"hosp_patients", "total_tests"
) # this is used to create ordered factor, which will then affect
# the order of variables displayed in the bar plot
plot3_data = read_delim(
filename_covid, delim = ",",
col_types = cols(
date = col_character(),
iso_code = col_character(),
location = col_character(),
total_cases_per_million = col_double(),
total_deaths_per_million = col_double(),
hosp_patients_per_million = col_double(),
total_tests_per_thousand = col_double()
)
) %>%
select(date, iso_code, location,
total_cases_per_million,
total_deaths_per_million,
hosp_patients_per_million,
total_tests_per_thousand) %>%
filter(
iso_code %in% c("AUT", "BEL", "BGR", "CZE", "DNK"),
date == "2020-10-20"
) %>%
mutate(
total_tests_per_thousand = total_tests_per_thousand * 1e03
# turn 'per thousand' to 'per million'
) %>%
pivot_longer(
cols = total_cases_per_million:total_tests_per_thousand,
names_pattern = "(.*)_per",
names_to = "variable"
) %>%
mutate(
variable = factor(variable, levels = variable_list, ordered = TRUE)
)
| date | iso_code | location | variable | value |
|---|---|---|---|---|
| 2020-10-20 | AUT | Austria | total_cases | 7395.963 |
| 2020-10-20 | AUT | Austria | total_deaths | 102.039 |
| 2020-10-20 | AUT | Austria | hosp_patients | 82.608 |
| 2020-10-20 | AUT | Austria | total_tests | 218961.000 |
| 2020-10-20 | BEL | Belgium | total_cases | 22606.875 |
| date | iso_code | location | variable | value |
|---|---|---|---|---|
| 0/20 | 0/20 | 0/20 | 0/20 | 0/20 |
Using library ggplotly, we can easily add interactivity to a ggplot object, and the interactivity will be retained in an HTML output of Rmarkdown.
Try the following:
plot3_title = paste0(
"Comparison of Cases, Deaths, Hospitalizations, and Testing"
)
plot3_base = plot3_data %>%
ggplot(aes(x = location, y = value,
fill = variable)) +
theme_bw() +
geom_col(position = "dodge", width = .7) +
scale_fill_brewer(
name = "Variable",
palette = "Paired"
) +
scale_x_discrete(name = "Countries") +
ggtitle(plot3_title) +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 10)
)
{plot3_base +
scale_y_continuous(name = "# per million")} %>%
plotly::ggplotly()
Due to the large data range, we use a log10 scale to transform the y-axis value to ensure all variables of interest are visible.
Note that when the mouse hovers over a bar, the value presented will still be the original value before scaling. That way we can know the exact value without referencing y-axis labels.
{plot3_base +
scale_y_continuous(name = "log10(# per million)", trans = "log10") +
theme(
axis.text.y = element_blank() # remove the y-axis label
# because they are not helpful due to the large data range
)
} %>%
plotly::ggplotly()
Note that the legend position is higher than static plots. We can move the legend items downwards using plotly::layout(), and move the legend title downwards by (i) hiding the original legend title, and (ii) add a new legend title using plotly::add_annotations.
{plot3_base +
scale_y_continuous(name = "log10(# per million)", trans = "log10") +
theme(
axis.text.y = element_blank(),
legend.title = element_blank() # hide the legend title
)
} %>%
plotly::ggplotly() %>%
# move legend downwards
plotly::layout(legend=list(y=0.8, yanchor="top")) %>%
# add the legend title back but to a lower position
plotly::add_annotations(
text="Variable",
xref="paper", x=1.02, xanchor="left",
yref="paper", y=0.8, yanchor="bottom",
legendtitle=TRUE,
showarrow=FALSE
)
Clickable Legend Toolbox was downloaded from MATLAB FileExchange in order to make the legend interactive. However, it should be noted that the original toolbox had a small bug associated with a coloring issue in the function (togglevisibility) that had to be corrected. The updated version of the toolbox is available in the project repository on github.
%% Data Prep
addpath('clickable_legend')
fileName = 'owid-covid-data.csv';
covid_orig = readtable(fileName);
% Select 2020-10-20 Data
newestDate = covid_orig.date(217);
covid_hosp = covid_orig(covid_orig.date == newestDate,:);
% Keep only countries that have hospitalization data
covid_hosp = covid_hosp(~isnan(covid_hosp.hosp_patients_per_million),:);
% Plotting the following variables:
% # of deaths, # of cases, # of tests, # hospitalizations
voi = {'location', 'total_cases_per_million', ...
'total_deaths_per_million','hosp_patients_per_million',...
'total_tests_per_thousand'};
% Update dataset to only include variables of interest
covid_hosp = covid_hosp(:, voi);
% Convert total_tests_per_thousand to total_tests_per_million
covid_hosp(:,'total_tests_per_thousand') = ...
table(1000*covid_hosp.total_tests_per_thousand);
covid_hosp.Properties.VariableNames{5} = 'total_tests_per_million';
Clickable Legend functionality can be added to any plots that has a unique legend entry. In our case, we use bar graphs to compare the number of cases, deaths, hospitalization, and testing across the 5 different European countries. However, it can be added to line plots and scatter plots (by getting rid of the line and specifying marker shape and size).
%% Plotting
nCountries = 5;
X = categorical(covid_hosp.location(1:nCountries));
Y = [covid_hosp{1:nCountries,2}, covid_hosp{1:nCountries,3},...
covid_hosp{1:nCountries,4}, covid_hosp{1:nCountries,5}];
figure
bar(X,Y)
Once the base plot has been plotted, we add the interactive functionality by calling function ‘clickableLegend’ and enable the toggling of the visibility of the bars by clicking on the text shown in the legend. In calling the function, we specify the name of each legend entry and we can specify the optional location of the legend, through passing of ‘Location’ argument pair (in this case ‘Location’, ‘eastoutside’).
Note: The y-axis was log scaled for better comparison of values across different statistics.
clickableLegend({'Case','Death','Hosp','Test'}, 'Location', 'eastoutside');
title('Comparison of Cases, Deaths, Hospitaliations, and Testing')
xlabel('Countries')
ylabel('log10(# per million)')
set(gca, 'YScale', 'log')
Unfortunately, MATLAB figure does not retain its interactive functionality through an html export, the example code will have to be executed in MATLAB. The .m script of this code is available in the project github repository.
[To be included in final report]:
# setup
import numpy as np
import plotly.express as px
import plotly.offline as offline
import plotly.graph_objs as go
# will be using the same dataset as in the bubble plot section
new3 = df2[df2['date'] == '2020-10-20']
new3 = new3[['location', 'total_cases_per_million', 'total_deaths_per_million',
'hosp_patients_per_million', 'total_tests_per_thousand']]
new3 = new3.dropna(axis=0, how='any')[0:5]
new3.head()
## location total_cases_per_million total_deaths_per_million \
## 3186 Austria 7395.963 102.039
## 4027 Belgium 22606.875 906.156
## 5204 Bulgaria 4393.357 145.068
## 12758 Czech Republic 16991.531 141.283
## 13840 Denmark 6188.319 118.435
##
## hosp_patients_per_million total_tests_per_thousand
## 3186 82.608 218.961
## 4027 256.178 370.342
## 5204 224.798 88.278
## 12758 412.458 176.115
## 13840 22.099 806.484
branches = new3['location']
total_cases_per_million = np.log10(new3['total_cases_per_million'])
total_deaths_per_million = np.log10(new3['total_deaths_per_million'])
hosp_patients_per_million = np.log10(new3['hosp_patients_per_million'])
total_tests_per_thousand = np.log10(new3['total_tests_per_thousand']*1000)
trace1 = go.Bar(
x = branches,
y = total_cases_per_million,
name = 'total_cases_per_million'
)
trace2 = go.Bar(
x = branches,
y = total_deaths_per_million,
name = 'total_deaths_per_million'
)
trace3 = go.Bar(
x = branches,
y = hosp_patients_per_million,
name = 'hosp_patients_per_million'
)
trace4 = go.Bar(
x = branches,
y = total_tests_per_thousand,
name = 'total_tests_per_million'
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(title = 'Comparison of Cases, Deaths, Hospitalizations, and Testings', barmode = 'group')
fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(title_text='Countries')
fig.update_yaxes(title_text='Log10 (# per million)')
fig.update_layout
offline.plot(fig, filename='interactive.html')
Data visualization can prove to be an immensely useful tool when trying to understand and organize large amounts of data. In this tutorial, we reviewed a few graphical concepts, marginal histogram, bubble plot, and interactive plots to demonstrate the functionality of these different graphical concepts and how they may facilitate a better grasp of the pattern captured in the data.
The tutorial is carried out in 4 different programming languages (R, STATA, Python, MATLAB). When it comes to data visualization, the two open-source software, R and Python take the clear lead in flexibility and functionality over the two proprietary languages, STATA and MATLAB. The open-source nature of R and Python has allowed the development of a vast number of user written libraries capable of producing publication quality plots with large customizability of the plots.
However, at the cost of less customizability on part of the user, plotting in MATLAB and STATA tends to be a little more succinct and straightforward, with many default parameters in place that do not need to be specified. We also find that STATA’s macros come in really handy when we want to unify settings across multiple figures without writing duplicated code.
However, we would note that once we begin to try to implement the same level of specifications in the plots as these open-source software, MATLAB and STATA lose this advantage. For example, with R’s ggplot2, we can easily change aesthetic options. If we want to enlarge the symbol size in a color legend while keeping the symbols in the main plot untouched, simply using guides(color = guide_legend(override.aes = list(size = 3))) will do. The same task can take much more effort in STATA: we have to create another layer of scatterplot that does not have any valid data points (so the plotting area will be empty) but has larger symbol size (for the legend). See the code example below:
sysuse auto
twoway /// the 3rd and 4th layers are empty scatter plots with larger msize
(scatter mpg weight if foreign==0, msymbol(o) mcolor(orange)) ///
(scatter mpg weight if foreign==1, msymbol(o) mcolor(green)) ///
(scatter mpg weight if mpg==., msymbol(o) mcolor(orange) msize(*3)) ///
(scatter mpg weight if mpg==., msymbol(o) mcolor(green) msize(*3)), ///
legend(order(3 4) label(3 domestic) label(4 foreign))
Python contains a lot of libraries, which allows different implementations of the same plot type. For instance, the bubble plot can be implemented using either matplotlib or Seaborn. Interestingly, the matplotlib package inherits lots of nice properties from MATLAB. Hence it is possible to make plots similar to MATLAB using this package.
It is a little bit difficult to implement complicated legend in Python using matplotlib (e.g. the size legend in the bubble plot). The same graph in Python can take more lines of code than using MATLAB. It is also worth noting that Python programs become much slower when the data size gets larger.